{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load and Process Graph Data\n",
    "\n",
    "In this session, you will learn:\n",
    "\n",
    "* Load graph data stored in CSV files.\n",
    "* Construct a graph in DGL.\n",
    "* Query structural information of a DGL graph.\n",
    "* Load and pre-process node and edge features.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load graph data from CSV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Comma Separated Values (CSV) is a widely-used format for storing relational data. In this tutorial, we have prepared two CSV files that store [the Zachery's Karate Club network](https://en.wikipedia.org/wiki/Zachary%27s_karate_club).\n",
    "* The `nodes.csv` stores every club members and their attributes.\n",
    "* The `edges.csv` stores the pair-wise interactions between two club members."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -lh 'data'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use `pandas` to load the two CSV files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "nodes_data = pd.read_csv('data/nodes.csv')\n",
    "print(nodes_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "edges_data = pd.read_csv('data/edges.csv')\n",
    "print(edges_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then construct a graph where each node is a club member and each edge represents their interactions. In DGL, **nodes are consecutive integers starting from zero**. Thus, when preparing the data, it is important to re-label or re-shuffle the row order so that the first row corresponding to the first nodes, so on and so forth.\n",
    "\n",
    "In this example, we have already prepared the data in the correct order, so we can create the graph by the `'Src'` and `'Dst'` columns from the `edges.csv` table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dgl\n",
    "\n",
    "src = edges_data['Src'].to_numpy()\n",
    "dst = edges_data['Dst'].to_numpy()\n",
    "\n",
    "# Create a DGL graph from a pair of numpy arrays\n",
    "g = dgl.graph((src, dst))\n",
    "\n",
    "# Print a graph gives some meta information such as number of nodes and edges.\n",
    "print(g)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A DGL graph can be converted to a `networkx` graph, so to utilize its rich functionalities such as visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import networkx as nx\n",
    "# Since the actual graph is undirected, we convert it for visualization\n",
    "# purpose.\n",
    "nx_g = g.to_networkx().to_undirected()\n",
    "# Kamada-Kawaii layout usually looks pretty for arbitrary graphs\n",
    "pos = nx.kamada_kawai_layout(nx_g)\n",
    "nx.draw(nx_g, pos, with_labels=True, node_color=[[.7, .7, .7]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query graph structures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's print out how many nodes and edges are there in this graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('#Nodes', g.number_of_nodes())\n",
    "print('#Edges', g.number_of_edges())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also perform queries on the graph structures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the in-degree of node 0:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g.in_degree(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the successors of node 0:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g.successors(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DGL provides APIs for querying structural information. See the API document [here](https://docs.dgl.ai/api/python/heterograph.html#querying-graph-structure)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image.png](../asset/dgl-query.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load node and edge features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In many graph data, nodes and edges have attributes. Although these attributes can have arbitrary types, a DGL graph only accepts attributes stored in tensors (with numerical contents). The vast development of deep learning has provided us many ways to vectorize various types of attributes into numerical features. Here are some general suggestions:\n",
    "* For categorical attributes (e.g. gender, occupation), consider converting them to integers or one-hot encoding.\n",
    "* For variable length string contents (e.g. news article, quote), consider applying a language model.\n",
    "* For images, consider applying a vision model such as CNNs.\n",
    "\n",
    "Our data set has the following attribute columns:\n",
    "* `Age` is already an integer attribute.\n",
    "* `Club` is a categorical attribute representing which community each member belongs to.\n",
    "* `Weight` is a floating number indicating the strength of each interaction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn.functional as F\n",
    "\n",
    "# Prepare the age node feature\n",
    "age = torch.tensor(nodes_data['Age'].to_numpy()).float() / 100\n",
    "print(age)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the features of node 0 and 10\n",
    "age[[0, 10]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `g.ndata` to set the age features to the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feed the features to graph\n",
    "g.ndata['age'] = age\n",
    "print(g)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The \"Club\" column represents which community does each node belong to.\n",
    "# The values are of string type, so we must convert it to either categorical\n",
    "# integer values or one-hot encoding.\n",
    "\n",
    "club = nodes_data['Club'].to_list()\n",
    "# Convert to categorical integer values with 0 for 'Mr. Hi', 1 for 'Officer'.\n",
    "club = torch.tensor([c == 'Officer' for c in club]).long()\n",
    "# We can also convert it to one-hot encoding.\n",
    "club_onehot = F.one_hot(club)\n",
    "print(club_onehot)\n",
    "\n",
    "# Use `g.ndata` like a normal dictionary\n",
    "g.ndata.update({'club' : club, 'club_onehot' : club_onehot})\n",
    "# Remove some features using del\n",
    "del g.ndata['age']\n",
    "\n",
    "print(g)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feeding edge features to a DGL graph is similar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get edge features from the DataFrame and feed it to graph.\n",
    "edge_weight = torch.tensor(edges_data['Weight'].to_numpy())\n",
    "# Similarly, use `g.edata` for getting/setting edge features.\n",
    "g.edata['weight'] = edge_weight\n",
    "print(g)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
